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Abstract 

This paper proposes GProp, a deep reinforcement iearning aigorithm for continuous poii- 
cies with compatibie function approximation. The aigorithm is based on two innovations. 
Firstiy, we present a temporai-difference based method for iearning the gradient of the 
vaiue-function. Secondiy, we present the deviator-actor-critic (DAC) modei, which com¬ 
prises three neurai networks that estimate the vaiue function, its gradient, and determine 
the actor’s poiicy respectiveiy. 

We evaiuate GProp on two chaiienging tasks: a contextuai bandit probiem constructed 
from nonparametric regression datasets that is designed to probe the abiiity of reinforce¬ 
ment iearning aigorithms to accurateiy estimate gradients; and the octopus arm, a chaiieng¬ 
ing reinforcement iearning benchmark. GProp is competitive with fully supervised methods 
on the bandit task and achieves the best performance to date on the octopus arm. 
Keywords: poiicy gradient, reinforcement iearning, deep iearning, gradient estimation, 
temporai difference iearning 


1. Introduction 

In reinforcement learning, an agent learns to maximize its discounted future rewards (Sutton 
and Barto, 1998). The structure of the environment is initially unknown, so the agent must 
both learn the rewards associated with various action-sequence pairs and optimize its policy. 
A natural approach is to tackle the subproblems separately via a critic and an actor (Barto 
et ah, 1983; Konda and Tsitsiklis, 2000), where the critic estimates the value of different 
actions and the actor maximizes rewards by following the policy gradient (Sutton et ah, 
1999; Peters and Schaal, 2006; Silver et ah, 2014). Policy gradient methods have proven 
useful in settings with high-dimensional continuous action spaces, especially when task¬ 
relevant policy representations are at hand (Deisenroth et ah, 2011; Levine et ah, 2015; 
Wahlstrom et ah, 2015). 

We tackle the problem of learning actor (policy) and critic representations. In the 
supervised setting, representation or deep learning algorithms have recently demonstrated 
remarkable performance on a range of benchmark problems. However, the problem of 
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learning features for reinforcement learning remains comparatively underdeveloped. The 
most dramatic recent success uses Q-learning over finite action spaces, and essentially build 
a neural network critic (Mnih et ah, 2015). Here, we consider continuous action spaces, and 
develop an algorithm that simultaneously learns the value function and its gradient, which 
it then uses to find the optimal policy. 

1.1 Outline 

This paper presents Value-Gradient Backpropagation (GProp), a deep actor-critic algorithm 
for continuous action spaces with compatible function approximation. Our starting point is 
the deterministic policy gradient and associated compatibility conditions derived in (Silver 
et ah, 2014). Roughly speaking, the compatibility conditions are that 

Cl. the critic approximate the gradient of the value-function and 

C2. the approximation is closely related to the gradient of the policy. 

See Theorem 2 for details. We identify and solve two problems with prior work on policy 
gradients - relating to the two compatibility conditions: 

PI. Temporal difference methods do not directly estimate the gradient of the value function. 
Instead, temporal difference methods are applied to learn an approximation of the 
form -|- Q'^(s,a), where Q^{s) estimates the value of a state, given the current 

policy, and (5'^(s,a) estimates the advantage from deviating from the current policy 
(Sutton et ah, 1999; Peters and Schaal, 2006; Deisenroth et ah, 2011; Silver et ah, 
2014). Although the advantage is related to the gradient of the value function, it is not 
the same thing. 

P2. The representations used for compatible approximation scale badly on neural networks. 
The second problem is that prior work has restricted to advantage functions constructed 
from a particular state-action representation, 0(s,a) = S/g /Xg(s)(a — fig{s)), that de¬ 
pends on the gradient of the policy. The representation is easy to handle for linear 
policies. However, if the policy is a neural network, then the standard state-action 
representation ties the critic too closely to the actor and depends on the internal struc¬ 
ture of the actor. Example 2. As a result, weight updates cannot be performed by 
backpropagation, see section 5.5. 

The paper makes three novel contributions. The hrst two contributions relate directly to 
problems PI and P2. The third is a new task designed to test the accuracy of gradient 
estimates. 

Method to directly learn the gradient of the valne fnnction. The first contribution 
is to modify temporal difference learning so that it directly estimates the gradient of the 
value-function. The gradient perturbation trick, Lemma 3, provides a way to simultaneously 
estimate both the value of a function at a point and its gradient, by perturbing the function’s 
input with uncorrelated Gaussian noise. 

Plugging in a neural network instead of a linear estimator extends the trick to the 
problem of learning a function and its gradient over the entire state-action space. Moreover, 
the trick combines naturally with temporal difference methods, Theorem 5, and is therefore 
well-suited to applications in reinforcement learning. 
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Deviator-Actor-Critic (DAC) model with compatible function approximation. 

The second contribution is to propose the Deviator-Actor-Critic (DAC) model, Definition 2, 
consisting in three coupled neural networks and Value-Gradient Backpropagation (GProp), 
Algorithm 1, which backpropagates three different signals to train the three networks. The 
main result, Theorem 6, is that GProp has compatible function approximation when im¬ 
plemented on the DAC model when the neural network consists in linear and rectilinear 
units. ^ 

The proof relies on decomposing the Actor-network into individual units that are con¬ 
sidered as actors in their own right, based on ideas in (Srivastava et ah, 2014; Balduzzi, 
2015). It also suggests interesting connections to work on structural credit assignment in 
multiagent reinforcement learning (Agogino and Turner, 2004, 2008; HolmesParker et ah, 
2014). 


Contextual bandit task to probe the accuracy of gradient estimates. A third 
contribution, that may be of independent interest, is a new contextual bandit setting de¬ 
signed to probe the ability of reinforcement learning algorithms to estimate gradients. A 
supervised-to-contextual bandit transform was proposed in (Dudfk et ah, 2014) as a method 
for turning classification datasets into AT-armed contextual bandit datasets. 

We are interested in the continuous setting in this paper. We therefore adapt their 
transform with a twist. The SARCOS and Barrett datasets from robotics have features 
corresponding to the positions, velocities and accelerations of seven joints and labels corre¬ 
sponding to their torques. There are 7 joints in both cases, so the feature and label spaces 
are 21 and 7 dimensional respectively. The datasets are traditionally used as regression 
benchmarks labeled SARCOSl through SARCOS7 where the task is to predict the torque 
of a single joint - and similarly for Barrett. 

We convert the two datasets into two continuous contextual bandit tasks where the 
reward signal is the negative distance to the correct label 7-dimensionah The algorithm is 
thus “told” that the label lies on a sphere in a 7-dimensional space. The missing information 
required to pin down the label’s position is precisely the gradient. For an algorithm to 
make predictions that are competitive with fully supervised methods, it is necessary to find 
extremely accurate gradient estimates. 


Experiments. Section 6 evaluates the performance of GProp on the contextual bandit 
problems described above and on the challenging octopus arm task (Engel et ah, 2005). 
We show that GProp is able to simultaneously solve seven nonparametric regression prob¬ 
lems without observing any labels - instead using the distance between its actions and the 
correct labels. It turns out that GProp is competitive with recent fully supervised learning 
algorithms on the task. Finally, we evaluate GProp on the octopus arm benchmark, where 
it achieves the best performance reported to date. 


1. The proof also holds for maxpooling, weight-tying and other features of convnets. A description of how 
closely related results extend to convnets is provided in (Balduzzi, 2015). 
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1.2 Related work 

An early reinforcement learning algorithm for neural networks is REINFORCE (Williams, 
1992). A disadvantage of REINFORCE is that the entire network is trained with a single 
scalar signal. 

Our proposal builds on ideas introduced with deep Q-learning (Mnih et ah, 2015), such 
as replay. However, deep Q-learning is restricted to hnite action spaces, whereas we are 
concerned with continuous action spaces. 

Policy gradients were introduced in (Sutton et ah, 1999) and have been used extensively 
(Kakade, 2001; Peters and Schaal, 2006; Deisenroth et ah, 2011). The deterministic policy 
gradient was introduced in (Silver et ah, 2014), which also proposed the algorithm COPDAC-Q. 
The relationship between GProp and COPDAC-Q is discussed in detail in section 5.5. 

An alternate approach, based on the idea of backpropagating the gradient of the value 
function, is developed in (Jordan and Jacobs, 1990; Prokhorov and Wunsch, 1997; Wang 
and Si, 2001; Hafner and Riedmiller, 2011; Fairbank and Alonso, 2012; Fairbank et ah, 
2013). Unfortunately, these algorithms do not have compatible function approximation in 
general, so there are no guarantees on actor-critic interactions. See section 5.5 for further 
discussion. 

The analysis used to prove compatible function approximation relies on decomposing the 
Actor neural network into a collection of agents corresponding to the units in the network. 
The relation between GProp and the difference-based objective proposed for multiagent 
learning (Agogino and Turner, 2008; HolmesParker et ah, 2014) is discussed in section 5.4. 

1.3 Notation 

We use boldface to denote vectors, subscripts for time, and superscripts for individual units 
in a network. Sets of parameters are capitalized (0, W, V) when they refer to matrices or 
to the parameters of neural networks. 

2. Deterministic Policy Gradients 

This section recalls previous work on policy gradients. The basic idea is to simultaneously 
train an actor and a critic. The critic learns an estimate of the value of different policies; the 
actor then follows the gradient of the value-function to find an optimal (or locally optimal) 
policy in terms of expected rewards. 

2.1 The Policy Gradient Theorem 

The environment is modeled as a Markov Decision Process consisting of state space S C M™, 
action space A C initial distribution pi{s) on states, stationary transition distribution 
p(sf+i|st, at) and reward function r : 5 x M M. A policy is a function pig ■. S ^ A from 
states to actions. We will often add noise to policies, causing them to be stochastic. In this 
case, the policy is a function /Xg : 5 —)• A_ 4 , where A _4 is the set of probability distributions 
on actions. 

Let pt{s —)■ s'j/x) denote the distribution on states s' at time t given policy pi and 
initial state s at t = 0 and let /j^(s') = s',pi)ds. Let r] = 
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be the discounted future reward. Define the 

value of a state-action pair: (s, a) = E[r^|Si = s, Ai = a; fig] and 

value of a policy: J{fig) = E a)]. 

The aim is to find the policy 6* := argmaxg J{fig) with maximal value. A natural ap¬ 
proach is to follow the gradient (Sutton et ah, 1999), which in the deterministic case can 
be computed explicitly as 

Theorem 1 (policy gradient) 

Under reasonable assumptions on the regularity of the Markov Decision Process the policy 
gradient can be computed as 


VJ{flg)= E 
9 S'^pr 


Proof See (Silver et ah, 2014). 


2.2 Linear Compatible Function Approximation 

Since the agent does not have direct access to the value function it must instead learn 
an estimate ~ Q^. A sufficient condition for when plugging an estimate (5'^(s,a) into 
the policy gradient Ve J{9) = E[Ve fig{s) Va (5'^®(s, a)|a=^^(s)] yields an unbiased estimator 
was first proposed in (Sutton et ah, 1999). 

A sufficient condition in the deterministic setting is: 

Theorem 2 (compatible value function approximation) 

The value-estimate Q'^(s, a) satisfies is compatible with the policy gradient, that is 


VJ{9)=E 

9 


V/20(s) • VQ'^(s,a)|,,=^^(,) 

C7 3. 


if the following conditions hold: 


Cl. approximates the value gradient: 

The weights learned by the approximate value function must satisfy w = 
where 


:= E 


VQ' 

a 


(s, a)| 3 _^^(g) ^ Q^(®) 3-)|a=/X0(s) 


argmhv, £aE{0,w'), 


2 


( 1 ) 


is the mean-square difference between the gradient of the true value function and 
the approximation Q'^. 


C2. is policy-compatible: 

The gradients of the value-function and the policy must satisfy 


VQ'^(s,a)|a=^^(g) = {Vpg{s),w). 

3 U 


( 2 ) 
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Proof See (Silver et al., 2014). 


Having stated the compatibility condition, it is worth revisiting the problems that we 
propose to tackle in the paper. The first problem is to directly estimate the gradient of the 
value function, as required by Eq. (1) in condition Cl. The standard approach used in the 
literature is to estimate the value function, or the closely related advantage function, using 
temporal difference learning, and then compute the derivative of the estimate. The next 
section shows how the gradient can be estimated directly. 

The second problem relates to the compatibility condition on policy and value gradients 
required by Eq. (2) in condition C2. The only function approximation satisfying C2 that 
has been proposed is 

Example 1 (standard value function approximation) 

Let 4>{s) he an m-dimensional feature representation on states and set 0(s, a) := Vg tJ.g{s) ■ 
(a — /.^^(s)). Then the value function approximation 

Q''’'^(s,a)= ( 0 (s,a),w) +( 0 (s), v) = (a - p 0 (s))T • V/^^(s)^ • w + 0 (s)t • v. 

^ -V ^ 

advantage function 


satisfies condition C2 of Theorem 2. 

The approximation in Example 1 encounters serious problems when applied to deep 
policies, see discussion in section 5.5. 

3. Learning Value Gradients 

In this section, we tackle the first problem by modifying temporal-difference (TD) learning 
so that it directly estimates the gradient of the value function. Eirst, we developed a new 
approach to estimating the gradient of a black-box function at a point, based on perturbing 
the function with gaussian noise. It turns out that the approach extends easily to learning 
the gradient of a black-box function across its entire domain. Moreover, it is easy to combine 
with neural networks and temporal difference learning. 

3.1 Estimating the gradient of an unknown function at a point 

Gradient estimates have been intensively studied in bandit problems, where rewards (or 
losses) are observed but labels are not. Thus, in contrast to supervised learning where it 
is possible to compute the gradient of the loss, in bandit problems the gradient must be 
estimated. More formally, consider the following setup. 

Definition 1 (zeroth-order black-box) 

A function / : —>• M zs a zeroth-order black-box if it can only be queried for zeroth- 

order information. That is, User can request the value f{x) of f at any point x G W^, but 
cannot request the gradient of the function. 

We use the shorthand black-box in what follows. 


6 



Compatible Value Gradients for Deep Reinforcement Learning 


The black-box model for optimization was introduced in (Nemirovski and Yudin, 1983), 
see (Raginsky and Rakhlin, 2011) for a recent exposition. In those papers, a black-box 
consists in a first-order oracle that can provide both zeroth-order information (the value of 
the function) and hrst-order information (the gradient or subgradient of the function). 

Remark 1 (reward function is a black-box; value function is not) 

The reward function r(s, a) is a black box since Nature does not provide gradient informa¬ 
tion. The value function (5^®(s,a) = E[r^|Si = s, Ai = a; is not even a black-box: it 
cannot be queried directly since it is defined as the expected discounted future reward. It 
is for this reason the gradient perturbation trick must be combined with temporal difference 
learning, see section 3.4. 

An important insight is that the gradient of an unknown function at a specihc point 
can be estimated by perturbing its input (Flaxman et ak, 2005). For example, for small 
(5 > 0 the gradient of / : M is approximately V /(x)|x=^ ~ d ■ u] where the 

expectation is over vectors sampled uniformly from the unit sphere. 

The following lemma provides a simple method for estimating the gradient of a function 
at a point based on Gaussian perturbations; 

Lemma 3 (gradient perturbation trick) 

The gradient of differentiable / : —>• M at pi G is 

V /(x)|x=u = lim argmin 
X O-2-S.0 weK'* 

Proof By taking sufficiently small variance, we can assume that / is locally linear. Setting 
b = /(/x) yields a line through the origin. It therefore suffices to consider the special case 
/(x) = (v,x). 

Setting 

w* = argmin E 

weR** e~Al(0,(T2-Id) 

we are required to show that w* = v. The problem is convex, so setting the gradient to zero 
requires to solve 0 = E [(w — v, e) • e] , which reduces to solving the set of linear equations 

d 

— u*) E[e*e-^] = (w^ — v^) E[(e-^)^] = {w^ — v^) ■ = 0 for all j. 

i=l 

The hrst equality holds since E[e®e-^] = 0. It follows immediately that w* = v. ■ 



mm 


E 


’■■w 


At 


+ e) - (w, e) -b'^ I 


( 3 ) 


3.2 Learning gradients across a range 

The solution to the optimization problem in Eq. (3) is the gradient V /(x) of / at a particular 
p G M'^. The next step is to learn a function —)• that approximates the gradient 

across a range of values. 
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More precisely, given a sample ~ of points, we aim to find 

n 

W* := argmin^ [||V/(xi) - G’''^(xi)||^ . 

The next lemma considers the case where and are linear estimates, of the form 
(5''(x) := ((/)(x), v), and G''^(x) = W • ■j/>(x) for fixed representations 0 : X —)■ M”* and 
0 : X ^ M’". 

Lemma 4 (gradient learning) 

Let / : —7- M 6e a differentiable function. Suppose that 0 : X —)• M™’ and 0 : X —)• are 

representations such that there exists an m-vector v* and a {d x n)-matrix W* satisfying 
/(x) = (0(x), V*) and V / = W* • 0(x) for all x in the sample. 

If we define loss function 

^(W,V,x,ct) =E 
€ 

then 

W* = lim argminmin E [£(W, V, x, fi)]. 
w V x~p 

Proof Follows from Lemma 3. ■ 



In short, the lemma reduces gradient estimation to a simple optimization problem given 
a good enough representation. Jumping ahead slightly to section 4, we ensure that our 
model has good enough representations by constructing two neural networks to learn them. 
The first neural network, —)• M, learns an approximation to /(x) that plays the 

role of the baseline b. The second neural network, G^ : —>■ learns an approximation 
to the gradient. 

3.3 Temporal difference learning 

Recall that Q^(s, a) is the expected value of a state-action pair given policy p. It is never 
observed directly, since it is computed by discounting over future rewards. TD-learning is 
a popular approach to estimating through dynamic programming (Sutton and Barto, 
1998). 

We quickly review TD-learning. Let cj) : S x A ^ M™' be a fixed representation. The 
goal is to find a value-estimate 


Q''(s,a) := (0(s,a),v), 


where v is an m-dimensional vector, that is as close as possible to the true value function. 
If the value-function were known, we could simply minimize the mean-square error with 
respect to v: 


^mse(v) = IE 


(Q'^(s,a) - Q^{s,a)^ 
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Unfortunately, it is impossible to minimize the mean-square error directly, since the value- 
function is the expected discounted future reward, rather than the reward. That is, the 
value function is not provided explicitly by the environment - not even as a black-box. The 
Bellman error is therefore used a substitute for the mean-square error: 


TD-error, 5 


^be(v) = 


E 




r(s, a) -h /i(s')) -Q^is, a) 


2i 


~Q'^(s,a) 

where s' is the state subsequent to s. 

Let 6t = rt — + 'yQ''{st+i, fJ,Q{st+i)) be the TD-error. TD-learning updates v 

according to 


vt+i ^\rt + r]f dfV Q^{st,SLt) =vt + r]t-^f ^(s, a), (4) 

V 

where r]t is a sequence of learning rates. The convergence properties of TD-learning and 
related algorithms have been studied extensively, see (Tsitsiklis and Roy, 1997; Dann et ah, 
2014). 


3.4 Temporal difference gradient (TDG) learning 

Finally, we apply temporal difference methods to estimate the gradient^ of the value func¬ 
tion, as required by condition Cl of Theorem 2. We are interested in gradient approxima¬ 
tions of the form 

Q^(s, a, e) := (G'^(s, a), e) = (W • r/>(s, a), e), 

where ip : S x A ^ M"" and W is a (d x n)-dimensional matrix. The goal is to find W* 
such that G''^* (s, a) « Ve Q^{s, a, e)|g=o = a)|a=/x 0 (s) for all sampled state-action 

pairs. 

It is convenient to introduce notation Q^(s,a, e) := (5^(s, a-|- e) and shorthand s := 
(s,p 0 (s)). Then, analogously to the mean-square, define the perturbed gradient error: 

(g'^(s,e)-(GW(s),e)-Q''(s))' , 

Given a good enough representation. Lemma 4 guarantees that minimizing the perturbed 
gradient error yields the gradient of the value function. Unfortunately, as discussed above, 
the value function cannot be queried directly. We therefore introduce the Bellman gradient 
error as a proxy 


^pge(v, W;o-^) = E E 


TDG-error, ^ 


-^BG£;(v, W;CJ^) = E E 

e 


r(s, e) - (G^(s), e) - Q''(s) 

'-V-" 


2. Residual gradient (RG) and gradient temporal difference (GTD) methods were introduced in (Baird, 
1995; Sutton et al., 2009a, b). The similar names may be confusing. RG and GTD methods are TD 
methods derived from gradient descent. In contrast, we develop a TD-based approach to learning gra¬ 
dients. The two approaches are thus complementary and straightforward to combine. However, in this 
paper we restrict to extending vanilla TD to learning gradients. 
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Set the TDG-error as 


6 = r{ste) + 7 Q'^(sf+i) - (G^(st), e) - Q''{st) 
and, analogously to Eq. (4), define the TDG-updates 

Vi+i ^ Vi + ryf • • V Q^{st) = ^rt + rjt ■ ■ 4>{^t) 

V 

Wi+i ^ Wi + r?i • 6 • V Q'^(si) = Wi + r?i • 6 • e ® ^(si), 

w 

where e ® is the {d x n) matrix given by the outer product. We refer to ^ • e as the 

perturbed TDG-error. 

The following extension theorem allows us to import guarantees from temporal-difference 
learning to temporal-difference gradient learning. 

Theorem 5 (zeroth to first-order extension) 

Guarantees on TD-learning extend to TDG-learning. 

The idea is to reformulate TDG-learning as TD-learning, with a slightly different reward 
function and function approximation. Since the function approximation is still linear, any 
guarantees on convergence for TD-learning transfered automatically to TDG-learning. 

Proof First, we incorporate e into the state-action pair. Define f(s, a, e) := r(s, a-|- e) and 

■0(s,a, e) = e (g) ■j/»(s,a). 

Second, we define a dot product on matrices of equal size by flattening them down to 
vectors. More precisely, given two matrices A and B of the same dimension (m x n), define 
the dot-product (A, B) = YlTj=i It is easy to see that 

G^(s, a) := (W • ^/;(s, a), e) = (■0(s, a, e), W). 

The TDG-error can then be rewritten as 

it = r(s, a, e) 7 Q''’^(s', a', e') - a, e) 

where a, e) = ((/>(s,a), v) -|- (■0(s, a, e), W) is a linear function approximation. 

If we are in a setting where TD-learning is guaranteed to converge to the value-function, 
it follows that TDG-learning is also guaranteed to converge - since it is simply a differ¬ 
ent linear approximation. Thus, Q^{s,e) ~ -l- G''^(s,e) and the result follows by 

Lemma 4. ■ 


4. Algorithm: Value-Gradient Backpropagation 

This section presents our model, which consists of three coupled neural networks that learn 
to estimate the value function, its gradient, and the optimal policy respectively. 
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Definition 2 (deviator-actor-critic) 

The deviator-actor-critic (DAC) model consists in three neural networks: 

• actor-network with policy plq : 5 —^ C 

• critic-network, : S x ^ M, that estimates the value function; and 

• deviator-network, : 5 x ^ that estimates the gradient of the value 

function. 

Gaussian noise is added to the policy during training resulting in actions a = /^e(®) + ^ 
where e ~ A^(0, • 1^). The outputs of the critic and deviator are combined as 

Q^’’^(^s,/re(s),e) = Q^{s,^lQ{s)) + <^G^(s,/r0(s)), 

The Gaussian noise plays two roles. Firstly, it controls the explore/exploit tradeoff by 
controlling the extent to which Actor deviates from its current optimal policy. Secondly, it 
controls the “resolution” at which Deviator estimates the gradient. 

The three networks are trained by backpropagating three different signals. Critic, De¬ 
viator and Actor backpropagate the TDG-error, the perturbed TDG-error, and Deviator’s 
gradient estimate respectively; see Algorithm 1. An explicit description of the weight up¬ 
dates of individual units is provided in Appendix A. 

Deviator estimates the gradient of the value-function with respect to deviations e from 
the current policy. Backpropagating the gradient through Actor allows to estimate the 
influence of Actor-parameters on the value function as a function of their effect on the 
policy. 

Algorithm 1: Value-Gradient Backpropagation (GProp). 

for rounds t = 1,2,... ,T do 

Network gets state s^, responds aj = + e, gets reward rt 

Let s := (s,/re(s)). 

f,t< — rt+'^Q^^{st+i )// compute TDG-error 
01+1 ^— Qt + rj^ -Vq /re^(st) • G’'^‘ (sf) // backpropagate G''^ 

\t+i <— Vt -k r/f • ■ Vv // backpropagate f 

Wt+i ^— Wt -k 7?^ • • Vw (st) • e // backpropagate f ■ e 


Critic and Deviator learn representations suited to estimating the value function and its 
gradient respectively. Note that even though the gradient is a linear function at a point, it 
can be a highly nonlinear function in general. Similarly, Actor learns a policy representation. 

We set the learning rates of Critic and Deviator to be equal (r/p = gP) in the experiments 
in section 6. However, the perturbation e has the effect of slowing down and stabilizing 
Deviator updates: 

Remark 2 (stability) 

The magnitude of Deviator’s weight updates depend on e ^ N{0,a‘^ ■ 1^) since they are 
computed by backpropagating the perturbed TDG-error ^ • e. Thus as —)• 0, Deviator’s 
learning rate essentially tends to zero. In general. Deviator learns more slowly than Critic. 
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This has a stabilizing effect on the policy since Actor is insulated from Critic - its weight 
updates only depend (directly) on the output of Deviator. 

5. Analysis: Deep Compatible Function Approximation 

Our main result is that the deviator’s value gradient is compatible with the policy gradient 
of each unit in the actor-network - considered as an actor in its own right: 

Theorem 6 (deep compatible function approximation) 

Suppose that all units are rectilinear or linear. Then for each Actor-unit in the Actor- 
network there exists a reparametrization of the value-gradient approximator, that sat¬ 

isfies the compatibility conditions in Theorem 2. 

The actor-network is thus a collection of interdependent agents that individually fol¬ 
low the correct policy gradients. The experiments below show that they also collectively 
converge on useful behaviors. 

Overview of the proof. The next few subsections prove Theorem 6. We provide a brief 
overview before diving into the details. 

Guarantees for temporal difference learning and policy gradients are typically based on 
the assumption that the value-function approximation is a linear function of the learned 
parameters. However, we are interested in the case where Actor, Critic and Deviator are 
all neural networks, and are therefore highly nonlinear functions of their parameters. The 
goal is thus to relate the representations learned by neural networks to the prior work on 
linear function approximations. 

To do so, we build on the following observation, implicit in (Srivastava et ah, 2014): 

Remark 3 (active submodels) 

A neural network ofn linear and rectilinear units can be considered as a set of 2^ submodels, 
corresponding to different subsets of units. The active submodel at time t consists in the 
active units (that is, the linear units and the rectifiers that do not output 0). 

The active submodel has two important properties: 

• it is a linear function from inputs to outputs, since rectifiers are linear when active, 
and 

• at each time step, learning only occurs over the active submodels, since only active 
units update their weights. 

The feedforward sweep of a rectifier network can thus be disentangled into two steps (Bal¬ 
duzzi, 2015). The first step, which is highly nonlinear, applies a gating operation that selects 
the active submodel - by rendering various units inactive. The second step computes the 
output of the neural network via matrix multiplication. It is important to emphasize that 
although the active submodel is a linear function from inputs to outputs, it is not a linear 
function of the weights. 

The strategy of the proof is to decompose the Actor-network in an interacting collection 
of agents, referred to as Actor-units. That is, we model each unit in the Actor-network as 
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an Actor in its own right that. On each time step that an Actor-unit is active, it interacts 
with the Deviator-submodel corresponding to the current active submodel of the Deviator- 
network. The proof shows that each Actor-unit has compatible function approximation. 

5.1 Error backpropagation on rectilinear neural networks 

First, we recall some basic facts about backpropagation in the case of rectilinear units. 
Recent work has shown that replacing sigmoid functions with rectifiers S{x) = max(0,x) 
improves the performance of neural networks (Nair and Hinton, 2010; Glorot et ah, 2011; 
Zeiler et ah, 2013; Dahl et ah, 2013). 

Let us establish some notation. The output of a rectifier with weight vector w is 

5w(x) := S'((w,x)) := max(0, (w,x)). 

The rectifier is active if (w, x) > 0. We use rectifiers because they perform well in prac¬ 
tice and have the nice property that units are linear when they are active. The rectifier 
subgradient is the indicator function 

l(x) := V5(x) = (^ 

\0 else. 

Consider a neural network of n units, each equipped with a weight vector w-^ G Hj C 
Hidden units are rectifiers; output units are linear. There are n units in total. It is 
convenient to combine all the weight vectors into a single object; let W C Ti = c 

where N = Yl'j=i ^j- network is a function : M"* —)• : Xm i—)• T^(xin) =: 

^out* 

The network has error function T(xout) y) with gradient g = Vx^ut Let denote the 
output of unit j and <^(xin) = denote its input, so that x^ = S{{w^,cf)^{x in)). 

Note that depends on W (specifically, the weights of lower units) but this is supressed 
from the notation. 

Definition 3 (influence) 

The influence of unit j on unit k at time t is := (Balduzzi et al, 2015). The 

influence of unit j on the output layer is the vector tt^ = 

The following lemma summarizes an analysis of the feedforward and feedback sweep of 
neural nets. 

Lemma 7 (structure of neural network gradients) 

The following properties hold 

a. Influence. 

A path is active at time t if all units on the path are firing. The influence of j on k is 
the sum of products of weights over all active paths from j to k: 

7rf= Y. E 

{a\j^a} \{/3|q^/3} \ 

where a, /?,..., w refer to units along the path from j to k. 
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b. Output decomposition. 

The output of a neural network decomposes, relative to the output of unit j, as 

F^{xin) = + 7V~^ ■ Xin, 

where 7r~^ is the (m x d)-matrix whose [ikY^ entry is the sum over all active paths from 
input unit i to output unit k that do not intersect unit j. 

c. Output gradient. 

Fix an input Xin G and consider the network as a function from parameters to outputs 
F*{xin) : ^ ^ : W I—;■ F^{xin) whose gradient is an {N xd)-matrix. The {ijY^-entry 

of the gradient is the input to the unit times its influence: 




ij 


(fl^{xin)-7V^ if unit j is active 
0 else. 


d. Backpropagated error. 

Fix Xin G and consider the function ^^(W) = S{F*{xin),y) : ^ M : W i-A 
£{F^{xin),y). Let g = V^^^,£{xout,y)- 

The gradient of the error function is 

{V^SY^ = (g, (VwF^(x»))^^.) 

= • (Vwi^^(Xm))jj = ■ 4P{xin) 


where the backpropagated error signal received by unit j decomposes as 6^ = (g, ) 

Proof Direct computation. 


The lemma holds generically for networks of rectifier and linear units. We apply it to 
actor, critic and deviator networks below. 

5.2 A minimal DAC model 

This subsection proves condition Cl of compatible function approximation for a minimal, 
linear Deviator-Actor-Critic model. The next subsection shows how the minimal model 
arises at the level of Actor-units. 

Definition 4 (minimal model) 

The minimal model of a Deviator-Actor-Critic consists in an Actor with linear policy 
gg{s) = (0, 0(s)) -|- e, where 6 is an m-vector and e is a noisy scalar. The Critic and 
Deviator together output: 

= Q^{s) + G^{fig{s),e) = (0(s), v) • {e,w), 

^ "v* “v" 

Critic Deviator 

where v is an m-vector, w is a scalar, and (e, w) is simply scalar multiplication. 
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The Critic in the minimal model is standard. However, the Deviator has been reduced 
to almost nothing: it learns a single scalar parameter, w, that is used to train the actor. 
The minimal model is thus too simple to be much use as a standalone algorithm. 

Lemma 8 (compatible function approximation for the minimal model) 

There exists a reparametrization of the gradient estimate of the minimal model G'^(s,e) = 
G^{ij. 9 {s),€) such that compatibility condition Cl in Theorem 2 is satisifed: 

VG'^(s,e) = (V/X 0 (s),w). 

e e 

Proof Let w := tc • and construct ^^(s, e) := (w • 0(s), e). Clearly, 

G’^(s, e) = {w-e'^ ■ cf){s), e) = fie{s) ■ {w, e) = G"'(/ie(s), e). 

Observe that G'^(s, e) = w ■ and that, similarly, 

(V/i0(s),W) = wne{s) 

u 

as required. ■ 


5.3 Proof of Theorem 6 

The proof proceeds by showing that the compatibility conditions in Theorem 2 hold for 
each Actor-unit. The key step is to relate the Actor-units to the minimal model introduced 
above. 

Lemma 9 (reduction to minimal model) 

Actor-units in a DAC neural network are equivalent to minimal model Actors. 

Proof Let denote the influence of unit j on the output layer of the Actor-network at 
time t. When unit j is active. Lemma 7ab implies we can write pLQ^(st) = cfKst)) -\- 

P 0 -j(si), where /rQ-j(st) is the sum over all active paths from the input to the output of 
the Actor-network that do not intersect unit j. 

Following Remark 3, the active subnetwork of the Deviator-network at time t is a linear 
transform which, by abuse of notation, we denote by W(. 

Combine the last two points to obtain 

{St) = W[ • (^TT^ • {9^ , cfl (st)) + Pq-, (st)) 

= (W( ■ ttI) ■ {9^, -|- terms that can be omitted. 

Observe that (W( • tt^) is a d-vector. We have therefore reduced Actor-unit j’s interaction 
with the Deviator-network to d copies of the minimal model. ■ 


Theorem 6 follows from combining the above Lemmas. 
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Proof Compatibility condition Cl follows from Lemmas 8 and 9. Compatibility condition 
C2 holds since the Critic and Deviator minimize the Bellman gradient error with respect 
to W and V which also, implicitly, minimizes the Bellman gradient error with respect to 
the corresponding reparametrized w’s for each Actor-unit. ■ 


Theorem 6 shows that each Actor-unit satisfies the conditions for compatible function 
approximation and so follows the correct gradient when performing weight updates. 

5.4 Structural credit assignment for multiagent learning 

It is interesting to relate our approach to the literature on multiagent reinforcement learning 
(Guestrin et al., 2002; Agogino and Turner, 2004, 2008). In particular, (HolmesParker et ah, 
2014) consider the structural credit assignment problem within populations of interacting 
agents: How to reward individual agents in a population for rewards based on their collective 
behavior? They propose to train agents within populations with a difference-based objective 
of the form 

Dj = Q{z) - Q{z_j,Cj) (5) 

where Q is the objective function to be maximized; Zj and z_j are the system variables 
that are and are not under the control of agent j respective, and Cj is a fixed counterfactual 
action. 

In our setting, the gradient used by Actor-unit j to update its weights can be described 
explicitly: 

Lemma 10 (local policy gradients) 

Actor-unit j follows policy gradient 


V J[fJ.eA = E 


V/x,,(s)-(7r^GW(s)> 


where {7vfG^{s)') ps D„j (5^(s) is Deviator’s estimate of the directional derivative of the 
value function in the direction of Actor-unit j’s influence. 

Proof Follows from Lemma 7b. ■ 


Notice that Vz- Q = Vz^ Dj in Eq. (5). It follows that training the Actor-network via 
GProp causes the Actor-units to optimize the difference-based objective - without requiring 
to compute the difference explicitly. Although the topic is beyond the scope of the current 
paper, it is worth exploring how suitably adapted variants of backpropagation can be applied 
to the reinforcement learning problems in the multiagent setting. 

5.5 Comparison with related work 

Comparison with COPDAC-Q. Extending the standard value function approximation in 
Example 1 to the setting where Actor is a neural network yields the following representation, 
which is used in (Silver et ah, 2014) when applying COPDAC-Q to the octopus arm task: 
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Example 2 (extension of standard value approximation to neural networks) 

Let p 0 : 5 —^ and : 5 —)• M 6 e an Actor and Critic neural network respectively. 
Suppose the Actor-network has N parameters (i.e. the total number of entries in Q). It 
follows that the Jacobian V 0 /X 0 (s) is an {N x d)-matrix. 

The value function approximation is then 

Q’^’^(s,a) = (a - p 0 (s))T • V 0 p 0 (s)T • w + qV(s) . 

advantage function Critic 

where w is an N-vector. 

Weight updates under COPDAC-Q, with the function approximation above, are therefore 
as described in Algorithm 2 . 

Algorithm 2: Compatible Deterministic Actor-Critic (COPDAC-Q). 

for rounds t = 1,2,... ,T do 

Network gets state St, responds a^ = + e where e N{0,a‘^ ■ Irf), gets 

reward rt 

dt <— n + (st+i) - (si) - (Ve p 0 ^ (st) • e, wt) 

04+1 ^— 04 + • Ve /r0^ (si) -VefJ-et 

V4+1 ^ V4 + rjG -dfVv (st) 

_ W4+1 ^— W4 + -dt-Ve ■ e 


Let us compare GProp with COPDAC-Q, considering the three updates in turn: 

• Actor updates. 

Under GProp, the Actor backpropagates the value-gradient estimate. In contrast 
under COPDAC-Q the Actor performs a complicated update that combines the policy 
gradient Ve/r(s) with the advantage function’s weights - and differs substantively 
from backprop. 


• Deviator / advantage-function updates. 

Under GProp, the Deviator backpropagates the perturbed TDG-error. In contrast, 
COPDAC-Q uses the gradient of the Actor to update the weight vector w of the advan¬ 
tage function. 

By Lemma 7d, backprop takes the form gT • V 0 /r 0 (s) where g is a d-vector. In 
contrast, the advantage function requires computing V 0 /r 0 (s)T • w, where w is an 
A-vector. Although the two formulae appear similarly superficially, they carry very 
different computational costs. 

The first consequence is that the parameters of w must exactly line up with those 
of the policy. The second consequence is that, by Lemma 7c, the advantage function 
requires access to 


(V0At0(s)),j 


cfT^ (s) • TT-^ if unit j is active 
0 else, 


where 0* *-^(s) is the input from unit i to unit j. Thus, the advantage function requires 
access to the input cfA{s) and the influence of every unit in the Actor-network. 
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• Critic updates. 

The critic updates for the two algorithms are essentially identical, with the TD-error 
replaced with the TDG-error. 

In short, the approximation in Example 2 that is used by COPDAC-Q is thus not well- 
adapted to deep learning. The main reason is that learning the advantage function requires 
coupling the vector w with the parameters 0 of the actor. 

Comparison with computing the gradient of the value-function approximation. 

Perhaps the most natural approach to estimating the gradient is to simply estimate the value 
function, and then use its gradient as an estimate of the derivative (Jordan and Jacobs, 1990; 
Prokhorov and Wunsch, 1997; Wang and Si, 2001; Hafner and Riedmiller, 2011; Fairbank 
and Alonso, 2012; Fairbank et ah, 2013). The main problem with this approach is that, 
to date, it has not been show that the resulting updates of the Critic and the Actor are 
compatible. 

There are also no guarantees that the gradient of the Critic will be a good approximation 
to the gradient of the value function - although it is intuitively plausible. The problem 
becomes particularly severe when the value-function is estimated via a neural network that 
uses activation functions that are not smooth such as rectifers. Rectifiers are becoming 
increasingly popular due to their superior empirical performance (Nair and Hinton, 2010; 
Clorot et ah, 2011; Zeiler et ah, 2013; Dahl et ah, 2013). 

6. Experiments 

We evaluate GProp on three tasks: two highly nonlinear contextual bandit tasks constructed 
from benchmark datasets for nonparametric regression, and the octopus arm. 

We do not evaluate GProp on other standard reinforcement learning benchmarks such 
as Mountain Car, Pendulum or Puddle World, since these can already be handled by linear 
actor-critic algorithms. The contribution of GProp is the ability to learn representations 
suited to nonlinear problems. 

Cloning and replay. Temporal difference learning can be unstable when run over a 
neural network. A recent innovation introduced in (Mnih et ah, 2015) that stabilizes TD- 
learning is to clone a separate network to compute the targets rt + 7 Q^(sf+i). The 
parameters of the cloned network are updated periodically. 

We implement a similar modification of the TDC-error in Algorithm 1. We also use 
experience replay (Mnih et ah, 2015). GProp is well-suited to replay, since the critic and 
deviator can learn values and gradients over the full range of previously observed state- 
action pairs offline. 

Cloning and replay were also applied to COPDAC-Q. Both algorithms were implemented 
in Theano (Bergstra et ah, 2010; Bastien et ah, 2012). 

6.1 Contextual Bandit Tasks 

The goal of the contextual bandit tasks is to probe the ability of reinforcement learning 
algorithms to accurately estimate gradients. The experimental setting may thus be of 
independent interest. 
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Contextual Bandit (SARCOS) 



Contextual Bandit (Barrett) 



Figure 1: Performance on contextual bandit tasks. The reward (negative normalized 
test MSE) for 10 runs are shown and averaged (thick lines). Performance variation for GProp 
is barely visible. Epochs refer to multiples of dataset; algorithms are ultimately trained on 
the same number of random samples for both datasets. 


Description. We converted two robotics datasets, SARCOS^ and Barrett WAM^, into 
contextual bandit problems via the supervised-to-contextual-bandit transform in (Dudik 
et ah, 2014). The datasets have 44,484 and 12,000 training points respectively, both with 21 
features corresponding to the positions, velocities and accelerations of seven joints. Labels 
are 7-dimensional vectors corresponding to the torques of the 7 joints. 

In the contextual bandit task, the agent samples 21-dimensional state vectors i.i.d. 
from either the SARCOS or Barrett training data and executes 7-dimensional actions. The 
reward r(s, a) = —||y(s) — a||| is the negative mean-square distance from the action to the 
label. Note that the reward is a scalar, whereas the correct label is a 7-dimensional vector. 
The gradient of the reward 

Jvr(s,a) =y(s) - a 

2 a 

is the direction from the action to the correct label. In the supervised setting, the gradient 
can be computed. In the bandit setting, the reward is a zeroth-order black box. 

The agent thus receives far less information in the bandit setting than in the fully 
supervised setting. Intuitively, the negative distance r(s,a) “tells” the algorithm that the 
correct label lies on the surface of a sphere in the 7-dimensional action space that is centred 
on the most recent action. By contrast, in the supervised setting, the algorithm is given 
the position of the label in the action space. In the bandit setting, the algorithm must 
estimate the position of the label on the surface of the sphere. Equivalently, the algorithm 
must estimate the label’s direction relative to the center of the sphere - which is given by 
the gradient of the value function. 


3. Taken from www.gaussianprocess.org/gpml/data/. 

4. Taken from http: //www. ausy. tu-darmstadt. de/Hiscellaneous/Miscellaneous. 
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The goal of the contextual bandit task is thus to simultaneously solve seven nonpara- 
metric regression problems when observing distances-to-labels instead of directly observing 
labels. The value function is relatively easy to learn in contextual bandit setting since 
the task is not sequential. However, both the value function and its gradient are highly 
nonlinear, and it is precisely the gradient that specifies where labels lie on the spheres. 

Network architectures. GProp and COPDAC-Q were implemented on an actor and devi¬ 
ator network of two layers (300 and 100 rectihers) each and a critic with a hidden layers of 
100 and 10 rectifiers. Updates were computed via RMSProp with momentum. The variance 
of the Gaussian noise a was set to decrease linearly from cr^ = 1.0 until reaching = 0.1 
at which point it remained fixed. 

Performance. Figure 1 compares the test-set performance of policies learned by GProp 
against COPDAC-Q. The final policies trained by GProp achieved average mean-square test 
error of 0.013 and 0.014 on the seven SARCOS and Barrett benchmarks respectively. 

Remarkably, GProp is competitive with fully-supervised nonparametric regression algo¬ 
rithms on the SARCOS and Barrett datasets, see Figure 2bc in (Nguyen-Tuong et al., 2008) 
and the results in (Kpotufe and Boularias, 2013; Trivedi et al., 2014). It is important to 
note that the results reported in those papers are for algorithms that are given the labels 
and that solve one regression problem at a time. To the best of our knowledge, there are 
no prior examples of a bandit or reinforcement learning algorithm that is competitive with 
fully supervised methods on regression datasets. 

For comparison, we implemented Backprop on the Actor-network under full-supervision. 
Backprop converged to .006 and .005 on SARCOS and BARRETT, compared to 0.013 and 
0.014 for GProp. Note that BackProp is trained on 7-dim labels whereas GProp receives 
1-dim rewards. 
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Figure 2: Gradient estimates for contextual bandit tasks. The normalized MSE 
of the gradient estimates compared against the true gradients, i.e. ^|| Vest — Vtrue llii ^re 
shown for 10 runs of COPDAC-Q and GProp, along with their averages (thick lines). 
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Accuracy of gradient-estimates. The true value-gradients can be computed and com¬ 
pared with the algorithm’s estimates on the contextual bandit task. Fig. 2 shows the per¬ 
formance of the two algorithms. GProp’s gradient-error converges to < 0.005 on both tasks. 
COPDAC-Q’s gradient estimate, implicit in the advantage function, converges to 0.03 (SAR- 
COS) and 0.07 (BARRETT). This confirms that GProp yields significantly better gradient 
estimates. 

COPDAC-Q’s estimates are significantly worse for Barrett compared to SARCOS, in line 
with the worse performance of COPDAC-Q on Barrett in Fig. 1. It is unclear why COPDAC-Q’s 
gradient estimate gets worse on Barrett for some period of time. On the other hand, since 
there are no guarantees on COPDAC-Q’s estimates, it follows that its erratic behavior is 
perhaps not surprising. 

Comparison with bandit task in (Silver et ah, 2014). Note that although the 
contextual bandit problems investigated here are lower-dimensional (with 21 -dimensional 
state spaces and 7-dimensional action spaces) than the bandit problem in (Silver et ah, 2014) 
(with no state space and 10, 25 and 50-dimensional action spaces), they are nevertheless 
much harder. The optimal action in the bandit problem, in all cases, is the constant 
vector [4,..., 4] consisting of only 4s. In contrast, SARCOS and BARRETT are nontrivial 
benchmarks even when fully supervised. 

6.2 Octopus Arm 

The octopus arm task is a challenging environment that is high-dimensional, sequential and 
highly nonlinear. 

Desciption. The objective is to learn to hit a target with a simulated octopus arm (Engel 
et ah, 2005).^ Settings are taken from (Silver et ah, 2014). Importantly, the action-space 
is not simplihed using “macro-actions”. The arm has C = 6 compartments attached to a 
rotating base. There are 50 = 8(7 + 2 state variables (x, y position/velocity of nodes along 
the upper/lower side of the arm; angular position/velocity of the base) and 20 = 3(7 + 2 
action variables controlling the clockwise and counter-clockwise rotation of the base and 
three muscles per compartment. 

After each step, the agent receives a reward of 10 • Adist, where Adist is the change in 
distance between the arm and the target. The final reward is -|-50 if the agent hits the 
target. An episode ends when the target is hit or after 300 steps. 

The arm initializes at eight positions relative to the target: ±45°, ±75°, ±105°, ±135°. 
See Appendix B for more details. 

Network architectures. We applied GProp to an actor-network with 100 hidden recti¬ 
fiers and linear output units clipped to lie in [ 0 , 1 ]; and critic and deviator networks both 
with two hidden layers of 100 and 40 rectifiers, and linear output units. Updates were 
computed via RMSProp with step rate of 10“^, moving average decay, with Nesterov mo¬ 
mentum (Hinton et al., 2012) penalty of 0.9 and 0.9 respectively, and discount rate 7 of 
0.95. 

5. Simulator taken from 

http://reinforcementlearningproject.googlecode.com/svn/trunk/FoundationsOfAI/ 

octopus-arm-simulator/octopus/ 
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Figure 3: Performance on octopus arm task. Ten runs of GProp and COPDAC-Q on a 
6-segment octopus arm with 20 action and 50 state dimensions. Thick lines depict average 
values. Left panel: number of steps/episode for the arm to reach the target. Right panel: 
corresponding average rewards/step. 


The variance of the Gaussian noise was initialized to = 1.0. An explore/exploit 
tradeoff was implemented as follows. When the arm hit the target in more than 300 steps, 
we set • 1.3; otherwise •(— (T^/1.3. A hard lower bound was fixed at = 0.3. 

We implemented COPDAC-Q on a variety of architectures; the best results are shown 
(also please see Figure 3 in (Silver et ah, 2014)). They were obtained using a similar 
architecture to GProp, with sigmoidal hidden units and sigmoidal output units for the actor. 
Linear, rectilinear and clipped-linear output units were also tried. As for GProp, cloning 
and experience replay were used to increase stability. 

Performance. Figure 3 shows the steps-to-target and average-reward-per-step on ten 
training runs. GProp converges rapidly and reliably (within ±170,000 steps) to a stable 
policy that uses less than 50 steps to hit the target on average (see supplementary video 
for examples of the hnal policy in action). GProp converges quicker, and to a better solu¬ 
tion, than COPDAC-Q. The reader is strongly encouraged to compare our results with those 
reported in (Silver et ah, 2014). To the best of our knowledge, GProp achieves the best 
performance to date on the octopus arm task. 

Stability. It is clear from the variability displayed in the figures that both the policy and 
the gradients learned by GProp are more stable than COPDAC-Q. Note that the higher vari¬ 
ability exhibited by GProp in the right-hand panel of Fig. 3 (rewards-per-step) is misleading. 
It arises because dividing by the number of steps ~ which is lower for GProp since it hits 
the target more quickly after training - inflates GProp’s apparent variability. 

7. Conclusion 

Value-Gradient Backpropagation (GProp) is the first deep reinforcement learning algorithm 
with compatible function approximation for continuous policies. It builds on the determinis- 


22 


















Compatible Value Gradients for Deep Reinforcement Learning 


tic actor-critic, CDPDAC-Q, developed in (Silver et al., 2014) with two decisive modifications. 
First, we incorporate an explicit estimate of the value gradient into the algorithm. Second, 
we construct a model that decouples the internal structure of the actor, critic, and deviator 
- so that all three can be trained via backpropagation. 

GProp achieves state-of-the-art performance on two contextual bandit problems where it 
simultaneously solves seven regression problems without observing labels. Note that GProp 
is competitive with recent fully supervised methods that solve a single regression problem 
at a time. Further, GProp outperforms the prior state-of-the-art on the octopus arm task, 
quickly converging onto policies that rapidly and fluidly hit the target. 

Acknowledgements. We thank Nicolas Heess for sharing the settings of the octopus arm 
experiments in (Silver et ah, 2014). 
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Appendices 


A. Explicit weight updates under GProp 

It is instructive to describe the weight updates under GProp more explicitly. 

Let 0^, w-^ and v-^ denote the weight vector of unit j, according to whether it belongs to 
the actor, deviator or critic network. Similarly, in each case tt-^ or M denotes the influence 
of unit j on the network’s output layer, where the influence is vector-valued for actor and 
deviator networks and scalar-valued for the critic network. 

Weight updates in the deviator-actor-critic model, where all three networks consist of 
rectifier units performing stochastic gradient descent, are then per Algorithm 3. Units that 
are not active on a round do not update their weights that round. 


Algorithm 3: GProp: Explicit updates. 


for rounds t = 1,2,... ,T do 

Network gets state St, responds = /r 0 (s() -|- e, 
6 <— rt + 'yQ^^{st+i) - - {G^^{st),e) 

for unit j = 1,2,... ,n do 

if j is an active actor unit then 

^t+i ^0i+ rjf ■ (si) (st) 

else if j is an active critic unit then 
W+i ^ + r/f • (it, TTt ) • (st) 

else if j is an active deviator unit then 

^ • Yt ■e,7viycf4 (si) 


gets reward r* 

// compute TDG-error 


// backpropagate 
// backpropagate ^ 
// backpropagate ^-e 


B. Details for octopus arm experiments 

Listing 1 summarizes technical information with respect to the physical description and 
task setting used in the octopus arm simulator in XML format. 
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Listing 1 Physical description and task setting for the octopus arm (setting.xml). 

<consta nts> 

<frictionTangential >0.4</frictionTangential > 
<frictionPerpendicular>l</frictionPerpendicular> 

<pressure >10</pressure> 

<gravity >0.01</ gravity > 

<surfaceLevel >5</s u rfaceLevel > 

<buoyancy>0.08</buoyancy> 

<muscleActive >0.1< / muscleActive> 
<musclePassive>0.04</musclePassive> 

<muscleNormalizedMinLength >0.1< / muscleNormalizedMinLength> 
<m uscleDa m pi ng >—!</muscleDamping> 
<repulsionConstant>.01</repulsionConstant> 
<repulsionPower>l</repulsionPower> 

<repulsionThreshold >0.7</repulsionThreshold > 
<torqueCoefficient >0.025< / torqueCoefficient > 

</consta nts > 

<targetTask t i m e L i m i t =" 300" step Rewa rd =" 1" > 

<target p os i t i o n =" —3.25 —3.25” reward="50" /> 

</ta rgetTask > 
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